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Ten-million-atom electronic structure calculations on the K computer 
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A massively parallel order-W electronic structure theory was constructed by an interdisciplinary research between 
physics, applied mathematics and computer science. (1) A high parallel efficiency with ten-million-atom nanomateri- 
als was realized on the K computer with upto 98,304 processor cores. The mathematical foundation is a novel linear 
algebraic algorithm for the generalized shifted linear equation. The calculation was carried out by our code 'ELSES ' 
(www.elses.jp) with modelled (tight-binding-form) systems based on ab initio calculations. (2) A post-calculation analy- 
sis method, called ;r-orbital crystalline orbital Hamiltonian population (;r-COHP) method, is presented, since the method 
is ideal for huge electronic structure data distributed among massive nodes. The analysis method is demonstrated in an 
sp^-sp^ nano-composite carbon solid, with an original visualization software 'VisBAR'. The present research indicates 
general aspects of computational physics with current or next-generation supercomputers. 

KEYWORDS: order-N electronic structure theory, numerical linear algebra, massive parallel computation, sp^-sp^^ 
nano-composite carbon solid, crystalline orbital Hamiltonian population. 
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A common issue in current computational physics is the 
theory for a large calculation with modern massively par- 
allel supercomputers, like the K computer. A large calcu- 
lation should accompany a large-data analysis theory, as a 
post-calculation tool, so as to obtain a physical conclusion 
from huge numerical data distributed among massive com- 
puter nodes. Interdisciplinary researches between physics, 
applied mathematics and computer science are crucial in 
this field and are sometimes called 'Application- Algorithm- 
Architecture co-design' . 

The present paper is devoted to the theories, both for large- 
scale calculation and large-data analysis, as order-A^ elec- 
tronic structure theories suitable for modern massively par- 
allelized computers. 

Order- electronic structure theories are those in which the 
computational time is proportional to the number of atoms 
in the system and are promising methods for large-scale 
calculation. The reference list can be found, for example, 
in Ref..' Recently, several novel linear algebraic algorithms, 
with Krylov subspace, were developed for the order-A^ the- 
ory, /. e. generalized shifted conjugate-orthogonal conjugate- 
gradient method,'"^ generalized Lanczos method,' gener- 
alized Amoldi method, ' Arnoldi (M, W, G) method,^ mul- 
tiple Arnoldi method,'' generalized shifted quasi-minimal- 
residual method.^ Their common foundation is the 'general- 
ized shifted linear equation', or the set of linear equations 



{zS -H)x^b. 



(1) 



Here z is a (complex) energy value and the Hamilto- 
nian and overlap matrices are denoted as H and S in the 
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linear-combination-of-atomic -orbital (LCAO) representation, 
respectively. They are sparse real-symmetric M x M matri- 
ces and S is positive definite. The vector b is an input and 
the vector x is the solution vector. Equation (1) is solved, 
iteratively, instead of the generalized eigen-value equation 
(Hyk - siiSyk)- The method is purely mathematical and may 
be useful also in other physics fields. Researches in the case 
of S = / are found, for examples, in Quantum Chromody- 
namics'' and many-body electron theory.^ 

The multiple Arnoldi method is used in this paper, since it 
is suitable for a molecular-dynamics (MD) simulation, or the 
calculation of energy and force. ^ See the paper^ for details. 
In short, the Green's function G = [zS - i/) ' is generated in 
the LCAO representation and the computation is parallelized, 
in the most procedures of the code, by the column index j of 
the Green's function (G,;).^ The physical quantities, such as 
energy and force, are calculated through the one-body density 
matrix defined by 



f{s - ij) Im Gij(s -H iO) ds, 



(2) 



where / is the occupation number in the Fermi-Dirac function 
with the chemical potential yU. The method is applicable both 
to insulators and metals. The method was implemented in a 
simulation package 'ELSES' (www.elses.jp) with modelled 
or tight-binding(TB)-form Hamiltonians based on ab initio 
calculations. Several calculations were carried out with the 
charge-self-consistent formulation.*^ 

In a previous paper,^ the order-A^ scaling with upto ten- 
million atoms was confirmed, as shown in Fig. 1 (a) and a 
parallel calculation was caiTied out with 10^ cores of Intel 
Xeon processors. The present paper will report a calculation 
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Fig. 1. (a) Benchmarlc for the order-A? scaling.^ (b) Benchmark for the par- 
allel efficiency (strong scaling) with lO' atoms on the K computer among P 
= 4,096 - 98,304 processor cores. See the text for details. 



with 10^ processor cores on the K computer and one will find 
several computational issues clarified in 10^-core calculation. 

In the present paper, the parallel efliciency is measured on 
the K computer for given materials with ten-million atoms. 
The resultant benchmark is called 'strong scaling' in the high- 
performance computation society. The numbers of the used 
processor cores are from P - f = 4, 096 to f = P^ax = 
98, 304. The MPI/OpenMP hierarchical parallelism was used 
and the parallel unit for the MPI or OpenMP parallelism is 
called 'node' or 'thread', respectively, throughout the present 
paper. The number of threads is fixed to be pChi^ad) _ jj^g 
maximum value on the K computer and the number of nodes 
is given by f = p/p(thiead) ^ p/g j.^^^ calculations were 
carried out in a couple of MD steps, with a TB-form Hamil- 
tonian,*^ for amorphous-like conjugated polymer, poly-(9,9 
dioctil-fluorene) (aPF) with A^=10,629,120 atoms^ and sp^- 
sp^ nano-composite carbon solid (NCCS) with A^= 10,32 1 ,920 
atoms (See Ref.'" for a related study). 

As a result, a high parallel efficiency is found in Fig. 1 
(b). The measure for the efficiency is obtained as a s 
T(Pn,in)/T(P^,,) X (Pniax/^min) = 0.79 or a = 0.95 for the 
aPF or NCCS case, respectively, where T(P) is the elapse 
time with P cores per MD step. In general, the elapse time 
increases with the increase of the number of orbitals per atom 
(wiorb), as well as the number of atoms (A^). The elapse time at 
the maximum cores is r(fniax)=39.1 sec for the NCC system 
and r(f niax)=15.3 sec for the aPF system, since the NCCS 
system has four (s- and p-type) orbitals per atom (iriorh = 4) 
and the aPF system contains hydrogen atoms with single (s- 
type) orbital and has, on average, only 2.3 orbitals per atom 
('Worb = 2.3). The NCCS system was calculated also in a TB- 
form Hamiltonian with additional d orbitals (mo,b - 9)." The 
resultant elapse time is r(Pmax)=311.0 sec and is larger than 
that without the d orbital (r(Pmax)=39.1 sec), as should be. 

Here several computational issues are discussed for (I) 
data-communication saving (II) memory-size saving and (III) 
parallel file reading/writing, since they are crucial for a high 
performance with practical computational resources. These 
issues affect the elapsed time and/or the required memory size 
but does not affect the resultant values of physical quantities. 
(I) The inter-node data communication is suppressed by the 



redundant calculation of several quantities, such as the ma- 
trix elements of H and S , among the nodes. ^ (II) A workflow 
for the memory saving was built in the code and the work- 
flow saves the required memory size drastically, at the sacri- 
fice of a moderate increase of the time cost. In the memory- 
saving workflow, the data array for the Green's function (G), 
the largest data array, is not stored in the memory but calcu- 
lated twice redundantly, as follows: 

(//,5) ^ G ^ju (2ndcalc.ofG) ^p. (3) 

The first calculation of the Green's function is carried out be- 
fore the determination of the chemical potential ju in the bi- 
section method. After that, the second calculation is carried 
out, since the density matrix p is generated, as in Eq. (2), both 
from the Green's function G and the chemical potential ju. In 
a result with 10^ atoms by a single-node workstation, the con- 
sumed memory size is reduced drastically from 28 GB in the 
non-memory-saving workflow into 1.6 GB in the memory- 
saving workflow, while the increase of the time is moderate 
(19 %). The reduction of the required memory is important, 
since the built-in memory size of the K computer is only 16 
GB per node. All the calculations in Fig. 1 were carried out in 
the memory-saving workflow. (Ill) Out test calculation shows 
that the parallel file reading can give an important acceler- 
ation and was realized with split input files. A typical file 
size with 10^ atoms is one G byte (B) for our input atomic- 
structure data described in the extensible markup language 
(XML) format.'^ When the file with atoms is split into 
Hspiit files, each split file contains the data with approximately 
A^/wspiit atoms. The file writing is also parallelized, when each 
node saves the atomic structure data in the split XML format. 
As a result with 4,096 cores (512 nodes), the consumed time 
for the initial procedures Ti„i, or the procedures before start- 
ing the electronic structure calculation, is drastically reduced, 
from Tini - 1 , 423 sec with the non-parallel file reading into 
T^ini - 69 sec with the parallel file reading.'-' The file writing 
procedure consumes a tiny time cost ( 0.2 - 0.4 sec). It is note- 
worthy that in most MD simulation studies, the file writing 
procedure of the atomic structure is carried out with a certain 
interval of MD steps, typically 10 steps. Therefore, the file 
writing time will be negligible in the whole simulation time. 

Now the discussion is turned into the second topic, the post- 
calculation analysis with huge electronic structure data. 

The present paper presents analysis methods based on the 
Green's function, /. e. crystalline orbital Hamiltonian popu- 
lation (COHP) method, and its theoretical extension called 
;7rCOHP method. The original COHP reveals the local bond- 
ing nature for each atom pair energetically and its definition 
is 

Cij(e) = —J] Hjp,i^lm Gi^.jpiE + iO), (4) 

where the basis suffices, previously denoted by / and j, are 
decomposed into the atom suffices, / or 7, and the orbital suf- 
fices, a or j0 (; = (I, a), j = {J,/3)). The energy integration of 
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Fig. 2. Bond visualization of sp^-sp' nano-composite carbon solid (NCCS) with lO'' atoms by the COHP and wCOHP analysis. The bonds are visualized 
both for the sp- and sp-' domains, or both for o" and n bonds in (a), while the bonds are visualized only for the sp^ domain or for ti bonds in (b). A closeup of 
a sp^-sp' domain boundary is shown in (c). 



COHP, caUed ICOHP, is also defined as 

B„ = J f(s-^i)CiAe)de. (5) 

The sum of ICOHP gives the electronic structure energy or 
the sum of the occupied eigen levels; 

^elec = ^ /(Si - yU) £^ = ^ B/y. (6) 

k 1,1 

A large negative value of Bu indicates the energy gain for the 
bond formation between the (/, J) atom pair. In the original 
paper,''* the analysis is carried out for ab initio calculations in 
the linear-muffin-tin-orbital formulation.'^ The method was 
applied not only to insulator but also to metal, such as in 
Ref.. '^ The method is suitable for the present order-A^ method, 
since the present method is based on the Green's function.'^ 

Here the ttCOHP is proposed as a theoretical extension of 
the original COHP. If the Hamiltonian contains the s- and p- 
type orbitals, for example, the off-site Hamiltonian term is 
decomposed into the cr and n components {Hjpja - H^jp-ia 
Hf^.jJ. The ttCOHP, CfJ, will be defined in Eq. (4), when 
Hjjsja is replaced by Hf^.^^. The ttICOHP, Bfj, will be defined 
in Eq. (5), when C/y is replaced by CfJ. The crCOHP, Cj'J"', 
and the crICOHP, B^^', are defined in the same manners. From 
the definitions, the (I)COHP is decomposed into the sum of 
the cr(I)COHP and 7r(I)COHP 

C,j(s) = Cfj>(s) + CfJ(E) (7) 

Bu = Bf; + BfJ. (8) 

Hereafter the original (I)COHP is called as 'full' (I)COHP In 



the code, the full, cr and 7r(I)COHP can be calculated auto- 
matically, without any additional data communication, during 
the massively parallelized order-A^ calculation. One can dis- 
tinguish the cr and n bonds, energetically, by the cr(I)COHP 
and ;r(I)COHP. A large negative value of B*^' or B^j indicates 
the croTTT bond formation between the atom pair, respectively. 

In Fig. 2, the (:7r)COHP analysis is demonstrated in an 
NCCS system with 10^ atoms, so as to distinguish the sp^ 
and sp^ domains, because one can distinguish the sp^ do- 
mains from the sp^ domains by the presence of n bonds. 
The system is a resultant structure of the our previous finite- 
temperature MD simulation with a periodic boundary condi- 
tion.'" The simulation is a preliminary research for the for- 
mation process of the nano-polycrystalline diamond (NPD), 
a novel ultra-hard materials. "^ '^ The NPD is produced di- 
rectly from graphitic materials and consists of 10-nm-scale 
diamond-structure domains with a characteristic lamella-like 
structure. Its growth process is of great interest and possi- 
ble precursor structures should be a ten-nm-scale composite 
between the sp^ (graphite) and sp^ (diamond) domains. The 
present research is motivated from the above problem, though 
the present structures, still, have a gap in the length scale, 
when it is compared with experiments, as discussed later 

Figure 2 (a) or (b) shows the bond visualization with the 
full ICOHP or the ;rICOHP respectively. In Fig. 2(a), a bond 
is drawn for an (/, J) atom pair, when its ICOHP value sat- 
isfy the condition of Bu < Bth with a given threshold value 
of Bth(< 0). We found a typical value of B± - -9 eV. The 
visualization with the full ICOHP indicates the visualization 
both for sp^ and sp^ bonds. In Fig. 2(b), on the other hand, a 
7T bond is drawn, when its n ICOHP value satisfy the condi- 
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tion of fif/ < Bjf with a given threshold value of Bf(< 0). 
We found a typical value of B^^^ = -1.5 eV. The visuaUza- 
tion with the ;rICOHP indicates the visualization only for sp^ 
bonds. 

The bond visuaUzation analysis in Figs. 2(a) and (b) con- 
cludes that the layered domains are sp^ domains and the non- 
layered domains are sp-' domains. Figure 2(c) shows the visu- 
aUzation of a boundary region between sp^ and sp^ domains. 
Here one can confirms that a layered domain form an sp^ or 
graphite-like structure and a non-layered domain form an sp^ 
or diamond-like structure, as expected from the (7r)ICOHP 
analysis. 

Several points are discussed for the (7r)COHP analysis. (I) 
The value of is much smaller than that of |£thl, because 
the n bonding is much weaker than the cr bonding. (II) The 
threshold values for the bond visualizations, Z?th and B^^, may 
not be universal among materials but is independent on the 
system size. One should choose a typical value once for a ma- 
terial and can use the value among different system sizes. (Ill) 
One should be careful, sometimes, in the interpretation of the 
analysis result, because the ;rCOHP analysis dose not detect 
an sp^ bond but detects a contribution of the 7r-bonding com- 
ponent, as explained above. For example, the initial structure 
for the simulation of Fig. 2 contains defects among the sp^ 
domains,'" as 'seeds' of the sp^-sp-' domain boundary. The 
resultant structure in Fig. 2 still has several initial defects in 
the sp^ domains and n bonds are often drawn in Fig. 2(b) at 
such local defective regions. Such a n bond does not mean an 
sp^ bond. (IV) The structure of Fig. 2 has a gap in the length 
scale, when it is compared with experiments. The structure is 
a '2D-like' one, because the periodic cell length in the perpen- 
dicular direction to the paper (2nm) is much smaller than the 
other two cell lengths, 17 nm or more.'" The artificial '2D- 
like' situation aff'ect severely the resultant atomic structure 
and makes a difficulty for a direct comparison between the 
simulation and experiment. A more realistic situation with the 
ten-nanometer simulation cell sizes in the three directions re- 
quires million-atom MD simulation, ten times larger than that 
in the present result.'" Such a larger MD simulation may be a 
possible target in near future with the parallel computations. 
(V) The bond visualization of Figs. 2(a)-(c) was realized by 
our original visualization tool 'VisBAR'(=Visualization tool 
with Ball, Arrow and Rods). The tool is based on Python 
(www.python.org) and was developed for our needs in the 
large-scale calculations, like the (;r)COHP analysis. 

In summary, (i) a high parallel efliciency was found in ten- 
million-atom order-A' electronic structure calculations on the 
K computer with approximately 10^ processor cores. Impor- 
tant computational issues are addressed for communication, 
memory size and file reading/writing, (ii) The (tt) COHP anal- 
ysis method is presented as a practical post-calculation anal- 
ysis method ideal for the huge distributed data of the Green's 
function. The analysis is demonstrated in a sp^-sp^ nano- 
composite carbon solid, so as to distinguish sp^ and sp' do- 
mains. The example shows a typical need of large-scale elec- 
tronic structure calculation that requires both large-scale cal- 
culation and large-data analysis with huge distributed data. 



The present research indicates general aspects of compu- 
tational physics, beyond electronic structure calculation, with 
current or next-generation supercomputers. Numerical algo- 
rithm and computer scientific methods will be inseparable 
from physics. A physical discussion should be constructed 
from physical quantities, Uke the Green's function or the 
(;7r-)COHP, that can be computed, analyzed and visualized 
with massively parallel computer architectures. All the meth- 
ods discussed here, ones for calculation, parallel file read- 
ing/writing with split XML file, memory saving workflow, 
post-calculation data analysis, and visualization, are designed 
to be suitable for the massively parallel computer architec- 
ture, and some of them may be useful in other computational 
physics fields. 
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